noise stability
Noise Stability of Transformer Models
Haris, Themistoklis, Zhang, Zihan, Yoshida, Yuichi
Understanding simplicity biases in deep learning offers a promising path toward developing reliable AI. A common metric for this, inspired by Boolean function analysis, is average sensitivity, which captures a model's robustness to single-token perturbations. We argue that average sensitivity has two key limitations: it lacks a natural generalization to real-valued domains and fails to explain the "junta-like" input dependence we empirically observe in modern LLMs. To address these limitations, we propose noise stability as a more comprehensive simplicity metric. Noise stability expresses a model's robustness to correlated noise applied to all input coordinates simultaneously. We provide a theoretical analysis of noise stability for single-layer attention and ReLU MLP layers and tackle the multi-layer propagation problem with a covariance interval propagation approach. Building on this theory, we develop a practical noise stability regularization method. Experiments on algorithmic and next-token-prediction tasks show that our regularizer consistently catalyzes grokking and accelerates training by approximately 35% and 75% respectively. Simplicity Biases have been a promising direction of study in recent years (Shah et al., 2020; V a-sudeva et al., 2024; Bhattamishra et al., 2022) as they provide a unifying framework for generalization, interpretability and robustness. Neural networks, including Large Language Models (LLMs), often converge to the simplest possible functions that explain the training data.
- North America > United States (0.28)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.14)
Noise Sensitivity and Stability of Deep Neural Networks for Binary Classification
Jonasson, Johan, Steif, Jeffrey E., Zetterqvist, Olof
The driving question of this paper is how robust a typical binary neural net classifier is to input noise, i.e. for a typical neural net classifier and a typical input, will tiny changes to that input make the classifier change its mind? When asking this, we take inspiration from phenomena observed for deep neural networks (DNN) used in practice and use that inspiration to give mathematically rigorous answers for some simple DNN models under one (of several possible) reasonable interpretations of the question. It is not a prerequisite for the reader to be familiar with DNNs to find the topic interesting and any Machine Learning lingo will be explained shortly. DNNs have shown results that range from good to staggering in many different data-driven areas, e.g. for prediction and classification. One of many reasons for this is that with sufficiently large models, neural networks can approximate any function [5].
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Sweden > Vaestra Goetaland > Gothenburg (0.04)
Understanding Influence Functions and Datamodels via Harmonic Analysis
Saunshi, Nikunj, Gupta, Arushi, Braverman, Mark, Arora, Sanjeev
It is often of great interest to quantify how the presence or absence of a particular training data point affects the trained model's performance on test data points. Influence functions is a classical idea for this [Jaeckel, 1972, Hampel, 1974, Cook, 1977] that has recently been adapted to modern deep models and large datasets Koh and Liang [2017]. Influence functions have been applied to explain predictions and produce confidence intervals [Schulam and Saria, 2019], investigate model bias [Brunet et al., 2019, Wang et al., 2019], estimate Shapley values [Jia et al., 2019, Ghorbani and Zou, 2019], improve human trust [Zhou et al., 2019], and craft data poisoning attacks [Koh et al., 2019]. Influence actually has different formalizations. The classic calculus-based estimate (henceforth referred to as continuous influence) involves conceptualizing training loss as a weighted sum over training datapoints, where the weighting of a particular datapoint z can be varied infinitesimally.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > France (0.04)
Using noise resilience for ranking generalization of deep neural networks
Morwani, Depen, Vashisht, Rahul, Ramaswamy, Harish G.
Recent papers have shown that sufficiently overparameterized neural networks can perfectly fit even random labels. Thus, it is crucial to understand the underlying reason behind the generalization performance of a network on real-world data. In this work, we propose several measures to predict the generalization error of a network given the training data and its parameters. Using one of these measures, based on noise resilience of the network, we secured 5th position in the predicting generalization in deep learning (PGDL) competition at NeurIPS 2020.
Explaining Landscape Connectivity of Low-cost Solutions for Multilayer Nets
Kuditipudi, Rohith, Wang, Xiang, Lee, Holden, Zhang, Yi, Li, Zhiyuan, Hu, Wei, Arora, Sanjeev, Ge, Rong
Efforts to understand how and why deep learning works have led to a focus on the optimization landscape of training loss. Since optimization to near-zero training loss occurs for many choices of random initialization, it is clear that the landscape contains many global optima (or near-optima). However, the loss can become quite high when interpolating between found optima, suggesting that these optima occur at the bottom of "valleys" surrounded on all sides by high walls. Therefore the phenomenon of mode connectivity (Garipov et al., 2018; Draxler et al., 2018) came as a surprise: optima (at least the ones discovered by gradient-based optimization) are connected by simple paths in the parameter space, on which the loss function is almost constant. In other words, the optima are not walled off in separate valleys as hitherto believed.
Identity Connections in Residual Nets Improve Noise Stability
Residual Neural Networks (ResNets) achieve state-of-the-art performance in many computer vision problems. Compared to plain networks without residual connections (PlnNets), ResNets train faster, generalize better, and suffer less from the so-called degradation problem. We introduce simplified (but still nonlinear) versions of ResNets and PlnNets for which these discrepancies still hold, although to a lesser degree. We establish a 1-1 mapping between simplified ResNets and simplified PlnNets, and show that they are exactly equivalent to each other in expressive power for the same computational complexity. We conjecture that ResNets generalize better because they have better noise stability, and empirically support it for both simplified and fully-fledged networks.
- Europe > Sweden > Stockholm > Stockholm (0.04)
- North America > United States > North Carolina > Durham County > Durham (0.04)